Skip to content

Fix URL decoding for robots.txt #58

Merged
lfoppiano merged 15 commits into
mainfrom
bugfix/robot-txt-url
May 5, 2026
Merged

Fix URL decoding for robots.txt #58
lfoppiano merged 15 commits into
mainfrom
bugfix/robot-txt-url

Conversation

@lfoppiano

@lfoppiano lfoppiano commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

This PR addresses the problem found in the April Crawl for processing robots.txt URLs that resulted in an empty/null hostname.

Background information: The target URLs of robots.txt redirects are not passed to URL filters before the redirects are followed. RFC 9309 just specifies to follow five levels of redirect and does not require or even recommend normalization of redirect targets. As a consequence, the robots.txt URLs stemming from redirects may have an unexpected form.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent empty/null hostnames when parsing malformed robots.txt-style URLs in WARC target URIs by normalizing malformed http(s) scheme slashes and adding regression tests.

Changes:

  • Normalize malformed http(s):////... inputs before parsing in WarcUri.
  • Add JUnit tests covering malformed scheme slashes and extra path slashes.
  • Ignore IntelliJ project metadata via .gitignore.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.

File Description
src/main/java/org/commoncrawl/net/WarcUri.java Normalizes malformed http(s) URLs before parsing; adds warning log on parse failure.
src/test/java/org/commoncrawl/net/WarcUriTest.java Adds regression tests for hostname extraction from malformed/oddly-slashed URLs.
.gitignore Adds IntelliJ IDEA project directory ignore entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/test/java/org/commoncrawl/net/WarcUriTest.java Outdated
Comment thread src/test/java/org/commoncrawl/net/WarcUriTest.java
Comment thread src/test/java/org/commoncrawl/net/WarcUriTest.java
Comment thread src/test/java/org/commoncrawl/net/WarcUriTest.java Outdated
Comment thread src/main/java/org/commoncrawl/net/WarcUri.java Outdated
Comment thread src/main/java/org/commoncrawl/net/WarcUri.java Outdated
Comment thread src/main/java/org/commoncrawl/net/WarcUri.java Outdated

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lfoppiano. See inline comments.

Comment thread src/main/java/org/commoncrawl/net/WarcUri.java Outdated
Comment thread src/test/java/org/commoncrawl/net/WarcUriTest.java Outdated
Comment thread src/test/java/org/commoncrawl/net/WarcUriTest.java Outdated
Comment thread src/test/java/org/commoncrawl/net/WarcUriTest.java Outdated
Comment thread src/test/java/org/commoncrawl/net/WarcUriTest.java Outdated
@lfoppiano

Copy link
Copy Markdown
Contributor Author

@sebastian-nagel I've tighten the application of the normalization only if the hostname is blank. There is a comment in test getHostNameMalformedHttpShouldNotBeEmpty: This is obsolete: if it's expected to be "www.google.com" it cannot be empty.. I'm not sure I understand it. I might have forgotten how it was working before, though.

@lfoppiano lfoppiano requested a review from sebastian-nagel May 2, 2026 21:21
@sebastian-nagel

Copy link
Copy Markdown
Contributor

comment in test getHostNameMalformedHttpShouldNotBeEmpty: This is obsolete: if it's expected to be "www.google.com" it cannot be empty.. I'm not sure I understand it. I might have forgotten how it was working before, though.

I meant: If there's an assertion for a string to by "xyz" it cannot be empty. One assertion is enough, which means less code but not less tests.

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good.

@lfoppiano lfoppiano merged commit cfa1554 into main May 5, 2026
7 checks passed
@lfoppiano lfoppiano deleted the bugfix/robot-txt-url branch May 5, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants